import pandas as pd
import numpy as np1 Goal
Today I chose a dataset retrieved from Kaggle. With data regarding the air quality of Delhi, I want to try to create a normal distribution of some of the data
df = pd.read_csv('data/day25/delhi_air_quality.csv')df.head(5)| Date | Month | Year | Holidays_Count | Days | PM2.5 | PM10 | NO2 | SO2 | CO | Ozone | AQI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2021 | 0 | 5 | 408.80 | 442.42 | 160.61 | 12.95 | 2.77 | 43.19 | 462 |
| 1 | 2 | 1 | 2021 | 0 | 6 | 404.04 | 561.95 | 52.85 | 5.18 | 2.60 | 16.43 | 482 |
| 2 | 3 | 1 | 2021 | 1 | 7 | 225.07 | 239.04 | 170.95 | 10.93 | 1.40 | 44.29 | 263 |
| 3 | 4 | 1 | 2021 | 0 | 1 | 89.55 | 132.08 | 153.98 | 10.42 | 1.01 | 49.19 | 207 |
| 4 | 5 | 1 | 2021 | 0 | 2 | 54.06 | 55.54 | 122.66 | 9.70 | 0.64 | 48.88 | 149 |
# Get an understanding of the
df.describe()| Date | Month | Year | Holidays_Count | Days | PM2.5 | PM10 | NO2 | SO2 | CO | Ozone | AQI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 | 1461.000000 |
| mean | 15.729637 | 6.522930 | 2022.501027 | 0.189596 | 4.000684 | 90.774538 | 218.219261 | 37.184921 | 20.104921 | 1.025832 | 36.338871 | 202.210815 |
| std | 8.803105 | 3.449884 | 1.118723 | 0.392116 | 2.001883 | 71.650579 | 129.297734 | 35.225327 | 16.543659 | 0.608305 | 18.951204 | 107.801076 |
| min | 1.000000 | 1.000000 | 2021.000000 | 0.000000 | 1.000000 | 0.050000 | 9.690000 | 2.160000 | 1.210000 | 0.270000 | 2.700000 | 19.000000 |
| 25% | 8.000000 | 4.000000 | 2022.000000 | 0.000000 | 2.000000 | 41.280000 | 115.110000 | 17.280000 | 7.710000 | 0.610000 | 24.100000 | 108.000000 |
| 50% | 16.000000 | 7.000000 | 2023.000000 | 0.000000 | 4.000000 | 72.060000 | 199.800000 | 30.490000 | 15.430000 | 0.850000 | 32.470000 | 189.000000 |
| 75% | 23.000000 | 10.000000 | 2024.000000 | 0.000000 | 6.000000 | 118.500000 | 297.750000 | 45.010000 | 26.620000 | 1.240000 | 45.730000 | 284.000000 |
| max | 31.000000 | 12.000000 | 2024.000000 | 1.000000 | 7.000000 | 1000.000000 | 1000.000000 | 433.980000 | 113.400000 | 4.700000 | 115.870000 | 500.000000 |
Thus we see that there is four years of data available, with recordings everyday for those four years. It would now be interesting to plot the PM2.5 column.
df.columnsIndex(['Date', 'Month', 'Year', 'Holidays_Count', 'Days', 'PM2.5', 'PM10',
'NO2', 'SO2', 'CO', 'Ozone', 'AQI'],
dtype='object')
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1461 entries, 0 to 1460
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1461 non-null int64
1 Month 1461 non-null int64
2 Year 1461 non-null int64
3 Holidays_Count 1461 non-null int64
4 Days 1461 non-null int64
5 PM2.5 1461 non-null float64
6 PM10 1461 non-null float64
7 NO2 1461 non-null float64
8 SO2 1461 non-null float64
9 CO 1461 non-null float64
10 Ozone 1461 non-null float64
11 AQI 1461 non-null int64
dtypes: float64(6), int64(6)
memory usage: 137.1 KB
import altair as alt
alt.Chart(df).mark_point().encode(
x='Month',
y='PM2.5'
)Can’t plot the PM2.5 column for whatever reason.
df = df.rename(columns={'PM2.5': 'PM2_5'})# Trying againg with the new column name
alt.Chart(df).mark_point().encode(
x='Month',
y='PM2_5'
)That did the trick. We can clearly see that PM2.5 particals are generally lowest in July-September. With December and January being the worst. There is however an outlier in June with a PM2.5 of a 1000, maybe the instrument that measured couldn’t read above that threshold.
2 Calculating the normal distribution for 2024 of PM2.5
import math
import matplotlib.pyplot as pltdf_2024 = df[df['Year'] == 2024]def normal_pdf(x, mu=0, sigma=1):
sqrt_two_pi = math.sqrt(2 * math.pi)
return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))# Storing the mean value of PM2.5 in 2024
mu = df_2024['PM2_5'].mean()
# Storing the standard deviation of PM2.5
sigma = df_2024['PM2_5'].std()# Remove outlier at 1000 PM2_5
df_2024 = df_2024[df_2024['PM2_5'] < df_2024['PM2_5'].quantile(0.99)]
# Creating a array of continuous values to plot probability for each value.
# As the pm2_5 column can't be used as-is, due to it missing values in the values between min and max
xs = np.arange(min(df_2024['PM2_5']), max(df_2024['PM2_5']))
# Storing y values of the function
y = []
for x in xs:
y.append(normal_pdf(x, mu=mu, sigma=sigma))# plotting distribution
plt.plot(xs, y)
plt.title("Normal distribution of PM2.5 in Delhi 2024")
plt.show()_files/figure-html/cell-16-output-1.png)
3 Reflections
We thus have a probability density distribution, where we can understand the probability of PM2.5 being any given value.
Besides calculating the normal distribution, it could be interesting to use linear regression, to be able to approximate the pm2.5 on any given day.